Efficient and effective ER with progressive blocking
نویسندگان
چکیده
Blocking is a mechanism to improve the efficiency of entity resolution (ER) which aims quickly prune out all non-matching record pairs. However, depending on distributions cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but adversely affect ER effectiveness, or (b) permissive, potentially harming efficiency. In this paper, we propose new methodology progressive blocking (pBlocking) enable both efficient and effective ER, works seamlessly across different size distributions. pBlocking based insight effectiveness–efficiency trade-off revealed only when output starts available. Hence, leverages partial in feedback loop refine result data-driven fashion. Specifically, bootstrap with traditional methods progressively building scoring blocks until get desired trade-off, leveraging limited amount results as guidance at every round. We formally prove converges efficiently ( $$O(n \log ^2 n)$$ time complexity, where n total number records). Our experiments show incorporating effectiveness by 5 $$\times $$ 60%, respectively, improving overall F-score entire process up 60%.
منابع مشابه
Reduction Effect of Effective Bandwidth and Blocking Rate with Dispersion
This paper analytically studies performance improvement with traffic dispersion for QoS guaranteed applications. In the analysis, we suppose that connection admission control based on the effective bandwidth is performed in packet network. Numerical results exhibit that traffic dispersion can greatly reduce the total effective bandwidth required by a source and the blocking rate of sources.
متن کاملAn Effective Hybrid Genetic Algorithm for Hybrid Flow Shops with Sequence Dependent Setup Times and Processor Blocking
Hybrid flow-shop or flexible flow shop problems have remained subject of intensive research over several years. Hybrid flow-shop problems overcome one of the limitations of the classical flow-shop model by allowing parallel processors at each stage of task processing. In many papers the assumptions are generally made that there is unlimited storage available between stages and the setup times a...
متن کاملthe past hospitalization and its association with suicide attempts and ideation in patients with mdd and comparison with bmd (depressed type) group
چکیده ندارد.
Efficient Progressive Skyline Computation
In this paper, we focus on the retrieval of a set of interesting answers called the skyline from a database. Given a set of points, the skyline comprises the points that are not dominated by other points. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. We present two novel algorithms, Bitmap and Index, to compute the skyline of ...
متن کاملEffective and efficient reuse with software libraries
Research in software engineering has shown that software reuse positively affects the competitiveness of an organization: the productivity of the development team is increased, the time-to-market is reduced, and the overall quality of the resulting software is improved. Today’s code repositories on the Internet provide a large number of reusable software libraries with a variety of functionalit...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The Vldb Journal
سال: 2021
ISSN: ['0949-877X', '1066-8888']
DOI: https://doi.org/10.1007/s00778-021-00656-7